61 research outputs found

    LSOSS: Detection of Cancer Outlier Differential Gene Expression

    Get PDF
    Detection of differential gene expression using microarray technology has received considerable interest in cancer research studies. Recently, many researchers discovered that oncogenes may be activated in some but not all samples in a given disease group. The existing statistical tools for detecting differentially expressed genes in a subset of the disease group mainly include cancer outlier profile analysis (COPA), outlier sum (OS), outlier robust t-statistic (ORT) and maximum ordered subset t-statistics (MOST). In this study, another approach named Least Sum of Ordered Subset Square t-statistic (LSOSS) is proposed. The results of our simulation studies indicated that LSOSS often has more power than previous statistical methods. When applied to real human breast and prostate cancer data sets, LSOSS was competitive in terms of the biological relevance of top ranked genes. Furthermore, a modified hierarchical clustering method was developed to classify the heterogeneous gene activation patterns of human breast cancer samples based on the significant genes detected by LSOSS. Three classes of gene activation patterns, which correspond to estrogen receptor (ER)+, ER− and a mixture of ER+ and ER−, were detected and each class was assigned a different gene signature

    A Comprehensive Analysis of Gene Expression Evolution Between Humans and Mice

    Get PDF
    Evolutionary changes in gene expression account for most phenotypic differences between species. Advances in microarray technology have made the systematic study of gene expression evolution possible. In this study, gene expression patterns were compared between human and mouse genomes using two published methods. Specifically, we studied how gene expression evolution was related to GO terms and tried to decode the relationship between promoter evolution and gene expression evolution. The results showed that (1) the significant enrichment of biological processes in orthologs of expression conservation reveals functional significance of gene expression conservation. The more conserved gene expression in some biological processes than is expected in a purely neutral model reveals negative selection on gene expression. However, fast evolving genes mainly support the neutrality of gene expression evolution, and (2) gene expression conservation is positively but only slightly correlated with promoter conservation based on a motif-count score of the promoter alignment. Our results suggest a neutral model with negative selection for gene expression evolution between humans and mice, and promoter evolution could have some effects on gene expression evolution

    Longitudinal random effects models for genetic analysis of binary data with application to mastitis in dairy cattle

    Get PDF
    A Bayesian analysis of longitudinal mastitis records obtained in the course of lactation was undertaken. Data were 3341 test-day binary records from 329 first lactation Holstein cows scored for mastitis at 14 and 30 days of lactation and every 30 days thereafter. First, the conditional probability of a sequence for a given cow was the product of the probabilities at each test-day. The probability of infection at time t for a cow was a normal integral, with its argument being a function of "fixed" and "random" effects and of time. Models for the latent normal variable included effects of: (1) year-month of test + a five-parameter linear regression function ("fixed", within age-season of calving) + genetic value of the cow + environmental effect peculiar to all records of the same cow + residual. (2) As in (1), but with five parameter random genetic regressions for each cow. (3) A hierarchical structure, where each of three parameters of the regression function for each cow followed a mixed effects linear model. Model 1 posterior mean of heritability was 0.05. Model 2 heritabilities were: 0.27, 0.05, 0.03 and 0.07 at days 14, 60, 120 and 305, respectively. Model 3 heritabilities were 0.57, 0.16, 0.06 and 0.18 at days 14, 60, 120 and 305, respectively. Bayes factors were: 0.011 (Model 1/Model 2), 0.017 (Model 1/Model 3) and 1.535 (Model 2/Model 3). The probability of mastitis for an "average" cow, using Model 2, was: 0.06, 0.05, 0.06 and 0.07 at days 14, 60, 120 and 305, respectively. Relaxing the conditional independence assumption via an autoregressive process (Model 2) improved the results slightly

    Comparison of Computational Models for Assessing Conservation of Gene Expression across Species

    Get PDF
    Assessing conservation/divergence of gene expression across species is important for the understanding of gene regulation evolution. Although advances in microarray technology have provided massive high-dimensional gene expression data, the analysis of such data is still challenging. To date, assessing cross-species conservation of gene expression using microarray data has been mainly based on comparison of expression patterns across corresponding tissues, or comparison of co-expression of a gene with a reference set of genes. Because direct and reliable high-throughput experimental data on conservation of gene expression are often unavailable, the assessment of these two computational models is very challenging and has not been reported yet. In this study, we compared one corresponding tissue based method and three co-expression based methods for assessing conservation of gene expression, in terms of their pair-wise agreements, using a frequently used human-mouse tissue expression dataset. We find that 1) the co-expression based methods are only moderately correlated with the corresponding tissue based methods, 2) the reliability of co-expression based methods is affected by the size of the reference ortholog set, and 3) the corresponding tissue based methods may lose some information for assessing conservation of gene expression. We suggest that the use of either of these two computational models to study the evolution of a gene's expression may be subject to great uncertainty, and the investigation of changes in both gene expression patterns over corresponding tissues and co-expression of the gene with other genes is necessary

    Genome wide association studies in presence of misclassified binary responses

    Get PDF
    BACKGROUND: Misclassification has been shown to have a high prevalence in binary responses in both livestock and human populations. Leaving these errors uncorrected before analyses will have a negative impact on the overall goal of genome-wide association studies (GWAS) including reducing predictive power. A liability threshold model that contemplates misclassification was developed to assess the effects of mis-diagnostic errors on GWAS. Four simulated scenarios of case–control datasets were generated. Each dataset consisted of 2000 individuals and was analyzed with varying odds ratios of the influential SNPs and misclassification rates of 5% and 10%. RESULTS: Analyses of binary responses subject to misclassification resulted in underestimation of influential SNPs and failed to estimate the true magnitude and direction of the effects. Once the misclassification algorithm was applied there was a 12% to 29% increase in accuracy, and a substantial reduction in bias. The proposed method was able to capture the majority of the most significant SNPs that were not identified in the analysis of the misclassified data. In fact, in one of the simulation scenarios, 33% of the influential SNPs were not identified using the misclassified data, compared with the analysis using the data without misclassification. However, using the proposed method, only 13% were not identified. Furthermore, the proposed method was able to identify with high probability a large portion of the truly misclassified observations. CONCLUSIONS: The proposed model provides a statistical tool to correct or at least attenuate the negative effects of misclassified binary responses in GWAS. Across different levels of misclassification probability as well as odds ratios of significant SNPs, the model proved to be robust. In fact, SNP effects, and misclassification probability were accurately estimated and the truly misclassified observations were identified with high probabilities compared to non-misclassified responses. This study was limited to situations where the misclassification probability was assumed to be the same in cases and controls which is not always the case based on real human disease data. Thus, it is of interest to evaluate the performance of the proposed model in that situation which is the current focus of our research

    A jackknife-like method for classification and uncertainty assessment of multi-category tumor samples using gene expression information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The use of gene expression profiling for the classification of human cancer tumors has been widely investigated. Previous studies were successful in distinguishing several tumor types in binary problems. As there are over a hundred types of cancers, and potentially even more subtypes, it is essential to develop multi-category methodologies for molecular classification for any meaningful practical application.</p> <p>Results</p> <p>A jackknife-based supervised learning method called paired-samples test algorithm (PST), coupled with a binary classification model based on linear regression, was proposed and applied to two well known and challenging datasets consisting of 14 (GCM dataset) and 9 (NC160 dataset) tumor types. The results showed that the proposed method improved the prediction accuracy of the test samples for the GCM dataset, especially when t-statistic was used in the primary feature selection. For the NCI60 dataset, the application of PST improved prediction accuracy when the numbers of used genes were relatively small (100 or 200). These improvements made the binary classification method more robust to the gene selection mechanism and the size of genes to be used. The overall prediction accuracies were competitive in comparison to the most accurate results obtained by several previous studies on the same datasets and with other methods. Furthermore, the relative confidence R(T) provided a unique insight into the sources of the uncertainty shown in the statistical classification and the potential variants within the same tumor type.</p> <p>Conclusion</p> <p>We proposed a novel bagging method for the classification and uncertainty assessment of multi-category tumor samples using gene expression information. The strengths were demonstrated in the application to two bench datasets.</p

    Comparison of Two Output-Coding Strategies for Multi-Class Tumor Classification Using Gene Expression Data and Latent Variable Model as Binary Classifier

    Get PDF
    Multi-class cancer classification based on microarray data is described. A generalized output-coding scheme based on One Versus One (OVO) combined with Latent Variable Model (LVM) is used. Results from the proposed One Versus One (OVO) outputcoding strategy is compared with the results obtained from the generalized One Versus All (OVA) method and their efficiencies of using them for multi-class tumor classification have been studied. This comparative study was done using two microarray gene expression data: Global Cancer Map (GCM) dataset and brain cancer (BC) dataset. Primary feature selection was based on fold change and penalized t-statistics. Evaluation was conducted with varying feature numbers. The OVO coding strategy worked quite well with the BC data, while both OVO and OVA results seemed to be similar for the GCM data. The selection of output coding methods for combining binary classifiers for multi-class tumor classification depends on the number of tumor types considered, the discrepancies between the tumor samples used for training as well as the heterogeneity of expression within the cancer subtypes used as training data

    Assessment of heterogeneity of residual variances using changepoint techniques

    Get PDF
    Several studies using test-day models show clear heterogeneity of residual variance along lactation. A changepoint technique to account for this heterogeneity is proposed. The data set included 100 744 test-day records of 10 869 Holstein-Friesian cows from northern Spain. A three-stage hierarchical model using the Wood lactation function was employed. Two unknown changepoints at times T1 and T2, (0 <T1 <T2 <tmax), with continuity of residual variance at these points, were assumed. Also, a nonlinear relationship between residual variance and the number of days of milking t was postulated. The residual variance at a time t() in the lactation phase i was modeled as: for (i = 1, 2, 3), where λι is a phase-specific parameter. A Bayesian analysis using Gibbs sampling and the Metropolis-Hastings algorithm for marginalization was implemented. After a burn-in of 20 000 iterations, 40 000 samples were drawn to estimate posterior features. The posterior modes of T1, T2, λ1, λ2, λ3, , , were 53.2 and 248.2 days; 0.575, -0.406, 0.797 and 0.702, 34.63 and 0.0455 kg2, respectively. The residual variance predicted using these point estimates were 2.64, 6.88, 3.59 and 4.35 kg2 at days of milking 10, 53, 248 and 305, respectively. This technique requires less restrictive assumptions and the model has fewer parameters than other methods proposed to account for the heterogeneity of residual variance during lactation
    corecore